l1 regularization path for functional features
نویسندگان
چکیده
for functional features Manuel Loth, Philippe Preux, INRIA, LIFL, CNRS, University of Lille, France, [email protected] Abstract: We consider the LASSO problem. Here, we propose ECON, a LARS-like algorithm that deals with parametrized features and finds their best parametrization. 1. From LARS to ECON 1.1 LARS The LASSO problem: • given N samples (xi, yi) ∈ D × R, find ŷ ≡ ∑k=K k=0 wkφk, with φj : D → R, that minimizes: ∑ i (ŷ(xi)− xi) + λ ∑ k |wk| •K is not fixed: it should be adjusted; • value constant of regularization λ: ??? The LARS algorithm [1]: • removes the problem of a priori setting the value of λ by computing the whole path of regularization, that is {(λ,w(λ))}λ∈R+ • the l1 regularization yields very sparse solutions (though very accurate, ŷ is still very sparse) • the algorithm is very efficient w.r.t. the number of potential features, Φ ≡ {φk}. • originally formulated with φj ≡ jth attribute; kernelized since then (φj(.) ≡ κ(xj, .), with κ a kernel function). Idea of LARS: • λ← +∞ • set of active features: A ← ∅ • set of potential features: P ← Φ\A • compute the bias w0← 1 N ∑ i yi and set φ0 ≡ 1 (identity function) • number of active features: K ≡ |A| • form ŷ ≡ ∑k=K k=0 wkφk •while stopping criteria not fullfilled – compute the residual r on the training set – check wether the weight of an active feature has been nullifyed; if yes, remove it from A, and put it back in P – otherwise, select φK+1 ≡ φ∗ the feature among P which is most correlated with r – compute its weight wK+1, add it to A, remove it from P – update the weights of all active features, set K + +, update λ. Key point of the LARS: the weight wK+1 is set so that this newly made active feature is as much correlated with the current residual as the other active features, and not as much as to minimize the current residual. See (kernel) basis pursuit algorithm [2]. At a certain iteration, the change in λ, ∆λ is computed exactly. The minimum is found by exhaustive search. Misc, but very important: some active feature may become inactive while riding the regularization path because its weight goes down to 0 (in magnitude). Such a feature leaves A and returns to P. For λ in between the λ’s computed at two subsequent iterations, the weight of the active features varies linearly. 1.2 ECON Because the minimization involved in the LARS is based on an exhaustive search, the LARS can not deal with an infinite number of features, not even with a very large one. The features have some hyper-parameters: the φ’s are really φθ where θ ∈ Θ ⊂ RT is some hyper-parameters. Usually, a finite, and small, set of hyper-parametrizations is chosen a priori, and the LARS is run with them. To the opposite, ECON deals with these infinite number of potential features, and selects the best combination of features, along with their hyper-parametrizations. The downside of ECON is that while LARS is solving exactly the minimization problem, ECON is not solving exactly the problem because of a lack of closed-form solution to this problem in general. So, we have to use a global optimizer as a heuristic to solve the problem numerically. We use DiRect to optimize this [3] which acts by dividing the domain recursively, is guaranteed to converge to the global optimum asymptotically, and the longer is run, the better the found solution. The upside of ECON are however numerous: • as the LARS itself, does not ask for a priori selection of the constant of regularization • ECON unique ability to select the best, or a least a good, hyper-parametrization of the features made active • rather fast and efficient in practice • once the kernel has been chosen, there is not parameter to hand-tune, • produce very sparse solutions that seem to capture the complexity of the dataset quite well (saturation of K even though more and more data are acquired) 2. Experiments 2.1 A toy classification problem We use the 2 spirals problem, and compare a SVM to ECON. For this problem, we actually use 3 datasets: the original one made of 194 points; two others, with 500, and 1000 points, each point belonging to either one of the 2 spirals. In each dataset, half of the examples belongs to each class. We perform 300 iterations of ECON on the whole dataset. dataset size 194 50
منابع مشابه
L1-regularization path algorithm for generalized linear models
We introduce a path following algorithm for L1-regularized generalized linear models. The L1-regularization procedure is useful especially because it, in effect, selects variables according to the amount of penalization on the L1-norm of the coefficients, in a manner that is less greedy than forward selection–backward deletion. The generalized linear model path algorithm efficiently computes so...
متن کاملThe Feature Selection Path in Kernel Methods
The problem of automatic feature selection/weighting in kernel methods is examined. We work on a formulation that optimizes both the weights of features and the parameters of the kernel model simultaneously, using L1 regularization for feature selection. Under quite general choices of kernels, we prove that there exists a unique regularization path for this problem, that runs from 0 to a statio...
متن کاملA Semismooth Newton Method for L1 Data Fitting with Automatic Choice of Regularization Parameters and Noise Calibration
This paper considers the numerical solution of inverse problems with a L1 data fitting term, which is challenging due to the lack of differentiability of the objective functional. Utilizing convex duality, the problem is reformulated as minimizing a smooth functional with pointwise constraints, which can be efficiently solved using a semismooth Newton method. In order to achieve superlinear con...
متن کاملLinearized Bregman for l1-regularized Logistic Regression
Sparse logistic regression is an important linear classifier in statistical learning, providing an attractive route for feature selection. A popular approach is based on minimizing an l1-regularization term with a regularization parameter λ that affects the solution sparsity. To determine an appropriate value for the regularization parameter, one can apply the grid search method or the Bayesian...
متن کاملLarge-scale Inversion of Magnetic Data Using Golub-Kahan Bidiagonalization with Truncated Generalized Cross Validation for Regularization Parameter Estimation
In this paper a fast method for large-scale sparse inversion of magnetic data is considered. The L1-norm stabilizer is used to generate models with sharp and distinct interfaces. To deal with the non-linearity introduced by the L1-norm, a model-space iteratively reweighted least squares algorithm is used. The original model matrix is factorized using the Golub-Kahan bidiagonalization that proje...
متن کاملLearning Combination Features with L1 Regularization
When linear classifiers cannot successfully classify data, we often add combination features, which are products of several original features. The searching for effective combination features, namely feature engineering, requires domain-specific knowledge and hard work. We present herein an efficient algorithm for learning an L1 regularized logistic regression model with combination features. W...
متن کامل